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Abstract 

In this paper, we are interested in the estimation and prediction of a parametric model on a 
short dataset upon which it is expected to overfit and perform badly. To overcome the lack of 
data (relatively to the dimension of the model) we propose the construction of an informative 
hierarchical Bayesian prior based upon another longer dataset which is assumed to share some 
similarities with the original, short dataset. We apply the methodology to a working model 
for the electricity load forecasting on both simulated and real datasets, where it leads to a 
substantial improvement of the quality of the predictions. 

Keywords: informative prior, hierarchical prior, mcmc algorithms, short dataset, electricity 
load forecasting 



1 Introduction 

Modelling and forecasting electricity loads is a problem well-known within both the academic 
and the applied statistics community (see e.g. |Brmn and Farmer} 1985^ . The signals studied 



usually exhibit strong properties such as seasonalities or weekly and daily profiles, leading to 
some very accurate models that tend to perform rather well under normal forecasting conditions. 
The approaches used to model and forecast them vary a lot: we mention a couple of them in the 
lines below. 

Some authors worked with imivariate time series models: [Taylorl(2003) built a double seasonal 



exponential smoothing for the British electricity load, while Taylor et al.^ (2006, 1 and Taylor and 



McSharry < |2007[| pres ented some comparative studies between univariate methods for different 



sets of data. Cugliari ( 2011) opted for a non-parametric approach relying on the wavelet transform 



to forecast the load curve seen as a functional-valued autoregressive Hilbertian process. Others 
tried and modelled the load together with the use of exogenous variables: [Harvey and Koopman 



( [1993^ included the temperature in their model, which inspired the Bayesian semi-parametric 
regression model found in [Smit h' ("2000 1. 

Alternatives to univariate modelling were often considered too, such as building multiple- 
equation models: while the various hours of the day share the same equation, the associated 
parameters differ from one another. Scares and Medeiros (2008) built an hourly independent 



seasonal auto-regressive model for their data, and Ramanathan et al. (1997) also built an rnde- 
pendent model for each hour of the day but took temperature effects into account. A state-space 
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model is proposed in Dordonnat et al. (2008 1 and Dordonnat (2009) where the parameters of the 



model are also allowed to vary along the time. 

The exogenous variables most commonly used to forecast the electricity load are weather- 
based, even though the decision to include one or more meteorological variables into a model 
may be open to discussion. The period of forecasting has to be taken into account, as well as 
the accuracy of the predictions for such variables. For temperate climates, the most important 
meteorological factor is the temperature (see e.g. |Taylor and Buizza 2003 1|. For the French 



electricity load specifically, the importance of the temperature and cloud-cover was underlined in 
Menage et al.' (1988 1 and Bruhns et al. ( 2005| . Other weather-related models include the works of 



Engle et al. (1986l;,Ramanathan et al.|(|1997| and CotteFand Smith] ([2003); [Smith and Kohn 



within the Bayesian framework. Let us also mention machine learning work: Hippert et al. 



(20021 
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for a neural networks implementation and 'Goude ( 2008) for a detailed study of online mixing 



algorithms used on a set of various predictors. 

We are interested in the development of a methodology to improve the estimation and the 



predictions of a parametric multi-equation model (similar to the one presented in Bruhns e t al.[ 
(2005^ ) over a short dataset. The limited size of the dataset coupled with the high dimensionality 
of the model leads to a typical overfitting situation when using the maximum likelihood approach: 
the fitted values are relatively close to the observations while the errors in prediction are an 
order of magnitude larger or more (note that due to the very periodic nature of its regressors, 
the model typically requires 4 or 5 years of data to provide satisfactory predictions). This 
overfitting behaviour can be somewhat alleviated by the use of a Bayesian estimation relying 
on an informative prior distribution, but the very fact that the data available is limited makes 
the posterior distribution all the more sensitive to the choice of that prior. Although electricity 
load curves may largely differ from one population to another, they may also share some 
common features. The latter case is expected to happen when the global population studied is 
an aggregation of non-homogeneous subpopulations for which the estimations are made harder 
due to the relative lack of data. 

To design a sensible prior in such a situation, we consider the case where another long dataset 
is available, upon which the model performs equally well in both estimation and prediction. 
We assume the long and the short datasets are somehow similar in a non-obvious way. That 
the similarity between the parameters underlying the two datasets (we will assume they are 
indeed coming from the model considered) cannot be easily guessed prevents us from trying 
to model the datasets simultaneously because it would require a rather precise knowledge of 
the link between the two. We propose a general way of building an informative hierarchical (see 



Gelman and Hill 2007} for a general review on the subject of hierarchical models) prior for the 



short dataset from the long one that goes as follows: 

1. we first estimate the posterior distribution on the long dataset using a non-informative 
prior, arguing that the design of an informative prior for this dataset is not necessary, since 
the data available is enough to estimate and predict the model in this case ; 

2. we extract key pieces of information from this estimation (e.g. moments) to design an 
informative prior for the short dataset which takes into accoimt the prior information that 
the datasets are somehow similar, via the introduction of hyperparameters designed to 
model and estimate this similarity. 

The paper is organised as follows. In Section [2| we focus on the general methodology 
and describe the way we carried our experimentation, we also present the general regression 
model used for our tests and applications. In Section [3j we present the semi-conjugated priors 
(informative and non-informative) used on each of the datasets. The ad hoc MCMC algorithms 
we developed to estimate the mean and variance of the posterior distributions are push backed 
into the appendix so as not to obfuscate the main point of the paper by technical details. In 
Section [Ij we use these algorithms to illustrate and validate our approach in simulated situations: 
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we show the contribution of the informative prior over the precision of both the estimated 
parameters and the forecasts in the case of a working electricity load forecasting model. In 
Section |5j we apply our method to French electricity datasets and compare the results with the 
outputs of 3 alternative standard methods to assess its competitivity. We also study the effect 
of the small dataset's sample size upon the predictive quality of the model and show that the 
informative prior provides reasonable forecasts even when the lack of data is severe. 

2 Methodology 
2.1 General principle 

Let us define here some notations that we shall keep throughout this paper. Hereafter, we denote 
B a short dataset over which we would like to estimate the model and we denote A a long dataset 
known or thought to share some common features with B. We will denote 9 the parameters of 
the model and the observations from A. 

We propose a method designed to help improve parameter estimations and model predictions 
over B with the help of A. Let tt-^ be the prior distribution used on A and TZ'^{-\y^) the 
associated posterior distribution. Observe that the choice of n-^ is not crucial as long as it 
remains non-informative enough because the model can be correctly estimated from the data 
alone on A. Let tt^ denote the prior distribution to be used on B. Notice that the naive pick 
jj-B _ jjA^.^yA'j jg j-(q|- g viable solution as soon as the parameters of A and B differ since the 
variance of the posterior distribution T['^{-\y^) is too small: in that case the data of B will not 
be able to make up for the difference between the posterior mean on A and the true value of 
the parameters of B. Assuming that the parameters corresponding to A and B are identical is 
too restrictive in practise. To allow for more flexibility we add hyperparameters accounting for 
the similarity between the datasets. We now described the informative hierarchical prior we 
designed. 

Assume that the prior distribution is to be chosen within the parametric family 

T = {tta; a e A}. 

Since selecting tt^ E J- is equivalent to picking A^ G A, and since we want /rg to retain some 
key-features of TZ-^{-\y'^), we want to pick A^ using some of the information contained inside 
the posterior distribution obtained on A. We assume that there exists an operator T : J- ^ A, 
such that 

Tin;,] = A, 

and choose A.^ proportional to T[7r'^(-|i/'^)], in the sense that 

A'^ = KTin^i-iy^)], 

where K : A ^ A itself is an unknown linear operator that we assume diagonal for ease of use. 

The operator K can be interpreted as a similarity operator between A and B, and its diagonal 
components as similarity coefficients measuring how close the two datasets really are when 
looked at through T. The diagonal components of K are hyperparameters of the prior we 
designed, and we give them a vague hierarchical prior distribution centred around q, the prior 
on q being vague and centred around 1 . 

The hyperparameter q may also be regarded as a more global similarity coefficient, since 
it represents the mean of all the similarity coefficients. The prior mean of q is forced to 1 to 
reflect the prior knowledge that the datasets are somehow similar. The variance of the prior 
distribution of q could in theory be reduced, going from a vague prior to a more informative 
structure, depending on the confidence we have over the similarity between the datasets. We 
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chose not to however, so as to keep the procedure we describe from requiring any deUcate 
subjective adjustments. 

We present now two frequent situations where the above procedure can be written in a 
simpler way. 

Example 1 (Method of Moments). We assume that the elements of T can he identified via their m 
first moments: the operator T can then be reduced to a function F of the m first moments operators, i.e. 
A = T[tix] = F(E(0), . . . ,E(0'")). The expression of then becomes 

=KF(E(0|y-^),...,E(0'"|y-^)). 

Note that, if the prior requires the specification of at least the two first moments, even though the priors 
from the upper layers of the model are vague, the correlation matrix estimated on the dataset A remains 
untouched and is directly plugged into in the informative prior if we consider centred moments for orders 
greater than 1. 

Example 2 (Conjugacy). We consider the case where T is the family of priors conjugated for the model. 
If the prior belongs to T then the associated distribution n^{-\y^') does too and there corresponds a 
parameter A-^(i/-^) to it. The expression of thus reduces to 

A^ = KA-^(y-^). 



2.2 Description of the model for the electricity load 



Modelling and forecasting the electricity load (or demand) on a day-to-day basis has long been a 
key activity for any company involved in the electricity industry. It is first and foremost needed 
to supply a fixed voltage at all ends of an electricity grid: to be able to do so, the amount of 
electricity produced has to match the demand very closely at any given time and experts usually 



make use of short-term forecasts with this aim in view as mentioned in Cottet and Smith, (2003) . 

Electricity load usually has a large predictable component due to its very strong daily, weekly 
and yearly periodic behaviour. It has also been noted in many regions that the weather usually 
affects the load too, the most important meteorological factor typically being the temperature 

for an example). 



(see Al-Zayer and Al-Ibrahim 

For each of the 48 instants of the day (each instant lasts 30 minutes, starting from 00:00am), 



the non-linear regression model that we consider in this paper, first described in Bruhns et al. 



< |2005^ , is made of three components, which we explain briefly in the next paragraphs, and is 
usually formulated as follows: for t = 1, . . . ,N, 



Z^^°^COS 



+ et 

Ijn 
365.25 
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365.25 



■E^;ln/0. 



where yt is the load of day t and where ei, . . . , are assumed independent and identically 
distributed with common distribution J\f{0, cr^). 

The x(^^ component is meant to account for the average seasonal behaviour of the electricity 
load, with a truncated Fourier series (whose coefficients are z™^ G IR and zj^ G M) and gaps 
(parameters coj £ R) which represent the average levels of electricity load over predetermined 
periods given by a partition {'^j) j^{\^,„^di2} °f the calendar. This partition usually specifies 
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Figure 1: Left : French Electricity load from 13/06/2005 to 29/06/2005 (in grey) and from 
05/12/2005 to 11/12/2005 (in black). The load is expressed in MW. Notice the daily patterns of 
the electricity load are not the same during summer and winter Right : French Electricity load at 
10:00 over 5 years against temperatures. The load seems to increase linearly with the temperature 
below a certain threshold. 



holidays, or the period of time when daylight saving time is in effect i.e. major breaks in the 
electricity consumption behaviour The left part of Figure [T| shows a typical behaviour over two 
different periods of time (summer vs. winter). 

The x(^) component allows for day-to-day adjustments of the seasonal behaviour x'^) through 
shapes (parameters ipj) that depends on the so-called days' types which are given by a second 
partition (Y^)yg|2 of the calendar This partition usually separates weekdays from weekends, 
and bank holidays. The differences between two different daytypes are visible on the left part of 
Figure [Tjtoo. For obvious identifiability reasons, the vector xp is restricted to the positive quadrant 
of the II ■ l|i-imit sphere in R'^s, that we denote 

5^^(0,1) = {!/.e(R+)''2; Mi=^}- 

The x('^) component represents the non-linear heating effect that links the electricity load to 



the temperature (see Seber and Wild} 2003 for a general presentation of non-linear models), with 
the help of 2 parameters. The heating threshold u & [u^u] corresponds to the temperature above 
which the heating effect is considered null and is usually estimated to be roughly around 15A°C. 
The heating effect is supposed to be linear for temperatures below the threshold and null for 
temperatures above. The restriction on the support of the threshold u simply expresses the fact 
that the threshold is sought within the range of the observed temperatures, i.e. u e [u, u] with 



min Tf < u < u < max Tf. 

t=l,...,N t=l,...,N 



The heating gradient 7 G IR* where M* = ]R\{0} represents the intensity of the heating effect, i.e. 
the slope (assumed to be non-zero) of the linear part that can be observed on the right part of 
Figure [1] 

Using the notation M,, for the i-th row of a matrix M, the previous model can be re-written 
in the following condensed and more generic way: for f = 1, . . . , N, 



yt = (A,.a)(Bf.^ + Cf)+7(Tf-M)l[T,,+co[(")+ef- 



(2) 
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The matrices A of size N x rf^, B of size N x dp, C of size N x 1, and T of size N x 1 are known 
exogenous variables while the parameters of the model to be estimated are 

= {oc,^,j,u,a^) e R''" X B+''(0,1) x M* x [u, u] x R% 

where b'^/{0,1) = e ||^||i < 1} is the positive quadrant of the || ■ ||i-unit ball of 

dimension dp. 

One could, without too much difficulty, add a cooling effect to the model, whose definition 
would be similar to that of the heating effect. Since the cooling effect remains far less important 
than the heating effect in France at the present time (see the right part of Figure [ij, and since 
the estimation of the associated cooling threshold is often unstable at best, we felt that adding 
such a part to the model was not as crucial as it would be in other countries where the cooling 
effect plays a much more important role. For the applications presented in Section |5] of this paper, 
we thus went for a simpler implementation: given a cooling threshold u'^, a regressor whose 
coordinates are (Tt — u'^)lj_^^j^j{u'^), for t = 1, . . . ,n is added to the matrix A. It models, in 
practise, a cooling effect with a fixed cooling threshold and an estimated cooling gradient that is 
multiplied by the daytype effect. 

Considering the expression in ||2}, the model is quite general since the bulk of it could be 
thought of as the product of two linear regressions, with the added twist of a non-linearity 
introduced via the threshold parameter u (change-point of the model). Even though the priors 
and algorithms constructed in the coming Sections do depend on the model introduced here, 
they can be modify in a rather straightforward manner, should the reader want to tweak the 
model a bit (for example to add another exogenous variable such as the wind or the cloud-cover). 

Hereafter L{y\9) will denote the likelihood of the observations y = (i/i, . . ■ ,yN)- We will often 
write the model for t — 1, . . . , N as 

yt=Ml)+et (3) 

and use the notation f{t]) = (/i ('/)/■ ■ ■//n('/)) for short, where rj = (a, /5, 7, m) designate the 
parameters of interest. With these notations, since 6 = {t],a^), the likelihood of the model 
described in ||2} reads: 

L{y\e) = [Vl^ay^'exp - fij^)f^ . 

3 Specifications of the priors 
3.1 Informative prior 

Let us denote ji'^ and S-^ the posterior mean and posterior variance of rj from a non-informative 
approach applied to the long dataset A, that we assume have already been collected. For the sake 
of clarity, we drop the notation: when not explicitly specified, the dataset and observations as 
well as the prior and posterior distributions we refer to in this Section will be those corresponding 
toB. 

We now present the informative hierarchical prior for the model (|2}, and then prove that it 
leads to a proper posterior distribution (see Proposition |3|. Following the methodology exposed 
in Section |2.1[ the informative prior that we propose introduces new parameters to model the 
similarity between the two datasets, (A:, Z) e x ]R and {q,r) e R x such that 

f]\k,i ^j^{K}i^,r^i:^) 

k\q,r^M{q{l,...,iy,r-^a) 
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where K = diag(A:). The coordinates of the vector k can be interpreted as similarity coefficients 
between parameters of A and B and the strictly positive scalar / can be seen as a way to 
alternatively weaken or strengthen the covariance matrix as needed. Hyperparameters q and 
r are more general indicators of how close A and B are, q corresponding to the mean of the 
coordinates of k and r being their inverse-variance. /, q, r and of course require a prior 
distribution too. For we use a non-informative prior (we chose n[cr^) = cr~-^) because we do 
not want to make any kind of assumptions about the noise aroimd both datasets. This prior 
is non-informative in the sense that it matches Jeffreys' prior distribution on £7^ for a Gaussian 
linear regression. For the three other parameters, based on semi-conjugacy considerations, we 
use: 

l^g{ai,h^), q^M{l,a^^), r ^ g{ar,hr), (4) 

where fl;, hi, ay, by and cr^ are fixed positive real numbers such that the prior distribution on I, 
q and r are vague. These prior distributions are chosen because of their conjugacy properties 
(as will be seen in the MCMC algorithm). The vagueness requirement that we impose on these 
priors is motivated by the fact that we want to keep as general a framework as possible without 
having to tweak each and every prior coefficient for different applications. 
In the end, the informative prior is built as follows: 

n{6,k,l,q,r) <x n{i]\k,l)Tz{k\q,r)Tz{l)n{q)n{r)Tz{(7^) (5) 

with 

n{f^\k,l) a I'i exp (-^('/ " Kfi^YK^-^rHv " Kfi^] 




n{k\q, r) ex \r\ 2 exp 

7r(/) exp (-f.,/)lR^(/) 

kq"^l^exp {^-^(T~^{q-lf 
n{r) (xr'''-iexp(-b,r)l]R^(r). 
Recalling the notations introduced at the end of Section |2j the posterior measure is given by 

n{e,k,l,q,r\y) « L{y\e)n{0,k,l,q,r) (6) 
c,a-^-^exp (^-^cr~^\\y-fiv)\\i^ 1[o,1]x[h,s]xr;(II^IIi.".'^^) 
X /2 exp (-liv- Kii^)'l{^^)-\r} - V; 



X \r\ 2 exp 



■^^E(^^ -^)') /'"-'exp(-?,,OlR;a) 



X |a--2|z exp f -^(7-2(^-1)2 j r''^-iexp(-V)lR;(r). 



Proposition 3. For (j6, u) E b'^_^ (0, 1) x [u, u] denote A* (j6, u) the matrix whose rows are 



{A,)t,{^,u) = {Bt,^ + Ct)At„{Tt-u)tyj^^+^y{u) 



t = l,...,N, 
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and suppose A'^{b,u)At.{b,u) has full rank for every {f>,u) E B^^(0, 1) x [m, u]. Assume furthermore 
that N > da + 1 and that (i/j, . . . are observations coming from the model Q and the posterior 
measure ||6} is then a well-defined (proper) probability distribution. 

Proof. First notice that J 7t(0,A:,/, (j,r|i/) dtr^ is proportional to 

lly -/('/) ll2"^l[0,i](ll|6|li)l[»,s](")^('?|fc.0^(fck.'')7r(0^(^)^W. 

for almost every y and that the function 6 ^ \\y ~ fiv) Wi^ bounded, for almost every y. The 
posterior integrability is hence trivial as long as n{ri\k,l)n{k\q,r)Tz{l)n{q)n{r) itself is a proper 
distribution which is the case here. □ 



3.2 Non-informative prior 

We now propose a non-informative prior to use with the long dataset A. Note that since the 
dataset A is long enough, the choice of the prior distribution used in this situation does not 
matter much as long as it remains vague enough: the Bayes estimator is expected to converge, 
as the number of observations grows, to the maximum likelihood estimator of the model we 
use. The main issue when studying the asymptotical behaviour of the posterior distribution 
or the maximum likelihood estimator is that the likelihood of the model is not continuously 
differentiable with regard to the heating threshold. Isolating the heating part from the rest 
of the model, Launay et al. ( 2012^ show that this issue can be dealt with and proove among 



other results that both the Bayes estimator and the maximum likelihood estimator are consistent. 
The non-informative prior is thus to be considered hereafter as an equivalent to the maximum 
likelihood approach for all intents and purposes. 

For the sake of clarity again, we drop the notation: when not explicitly specified, the dataset 
and observations as well as the prior and posterior distributions we refer to in Section will be 
those corresponding to A. We show that the use of a non-informative prior distribution leads to 
a proper posterior distribution (see Proposition |4]|. 
We use the following non-informative prior 

7t(0) a tJ-^. 

This prior is non-informative in the sense that it matches Jeffreys' prior distribution on cr^ for a 
Gaussian linear regression and matches Laplace's flat prior on the other parameters. It leads to 
the following posterior distribution 

n{0\y)cxL{y\0)n{e) (7) 
oc (7-^-2 exp (-^(r-^\\y-f{f])\\2^ 1[o,i]x[m,u]xr;(II^IIi'"''^^)- 

Proposition 4. For {fi, u) E b'^'' (0, 1) x [m, u] denote A* (j6, m) the matrix whose rows are 



, t = l,...,N, 



and suppose A'^{b,u)A^:{b,u) has full rank for every (/3, m) G BJ^{0,1) x [m, u]. Assume furthermore 
that N > da + 1 and that (yi, . . . ,yN) are observations coming from the model ||2}, the posterior measure 
|[7| is then a well-defined (proper) probability distribution. 

Proof. Notice first that 

J n{r],a^\y)da^ cx ||y -/(??) Il2~l[ai] (") for almost every y. 



8 



and observe then that 

N -|2 

Ily-/('7)ll2 = E yt-(Bt./5 + Cf)Af.a-(Tf-M)l[7-,,+^[(M)7 . 
f=l 

Let (/3o, Mo) G B^|,''(0, 1) x [m, m] and denote a* = (a, 7). We write 

^ -|2 

ll3/-/((a'/3o,7/"o))ll2 = E ~ (Bf.^o + Cf)At.a- (Tf - mo)1[t,,+oo[("o)7 

= lly- A*(,6o,wo)a*|li 
and thus obtain the following equivalence, as (j6, u) (/5o, wq) and ||a;* II2 ^ +00 

lly -/('?) II2 ^ lly-^*(/5o,"oKll2~ (8) 

The triangular inequality applied to the right hand side of (|8} gives 

lly - A*(^0/"o)a*||2'^ < III1/II2 - ||A*(,6o,Mo)a*||2r^- (9) 
Since A'^ (/5o, mq)^* (|6o/ "o) has full rank, by straightforward algebra we get 

A||a*j|2 < ||A*(^0/"o)a*|l2/ 

where A is the smallest eigenvalue (A* (jSg, mq))'^* (i^O/ "o) and is strictly positive. We can hence 
find an equivalent of the right hand side of l|9} as ||a* II2 — >• +00, which is 



||y||2- ||A,(/3o,Mo)a*||2| ~~A-^/2||^^||^-N_ (10) 



2 

Combining ||8}, l|9| and | (T0) together, we see that the integrability of the left hand side of l|8} 
as (j6, m) ^ (/5o, Wo) and ||a*||2 ^ +00 is directly implied by that of ||a*||2^. The latter is 
of course immediate for N > da- + 1 as can be seen via a quick Cartesian to hyperspherical 
re-parametrisation. 

The previous paragraph thus ensures the integrability of \\y — f{f])\\2^ over sets of the form 

{(,6,m) e V{{^o,uo)), ||a*||2 e]M(,6o,Mo), +~[}, V(^o,"o) e bJ(0,1) x [m, u] 

where the subset V{{bo,U())) is an open neighbourhood of (jSq, mq) and M(/3o, wq) is a real 
number depending on (/3o, mq)- By compacity of B_^'' (0, 1) x [m, m] there exists a finite union of 

such V((/5,v M;)) that covers b'^'' (0, 1) x [m, m]. Denoting M the maximum of M(j6/, m,) over the 
corresponding finite subset of (/5,, «,), we finally obtain the integrability of \\y — fit]) ^ over 

{(^,m) e Bj(0,l),||a,|| e]M, +~[}. 

The integrability of ||y -/(??) 112^^ over {((6,m) e b'^''(0,1), ||a*[| e [0, M]} is trivial, recalling 
that rj ^ l|y~/('/)l|2is continuous and does not vanish over this compact for almost every y, 
meaning its inverse shares these same properties. □ 

Remark 5. The condition "A'^A^ has full rank" mentioned above is typically verified in our applications 
for the regressors used in our model. To see this, call "vector of heating degrees" the vector whose 
coordinates are {Tt — u)l^j^^^^^{u), then not verifying the aforementioned condition is equivalent to 
saying that there exists an index i and a threshold u such that the family of vectors formed by the regressors 
A and the vector of heating degrees is linearly dependent over the subset T,- of the calendar". 
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4 Numerical evaluations of the performance on simulated data 



In this Section we simulate a long dataset A and a short dataset B from the model (|2} to assess 
the performance of the informative prior as the similarity between the datasets varies. To measure 
the improvement brought by the informative prior we compare the estimation and prediction on 
dataset B with a non-rnformative prior For any estimation (posterior mean and variance) on a 
dataset (be it A or B), the MCMC algorithms would typically rim for 500,000 iterations after a 
small burn-in period. 

4.1 Comparing the informative and the non-informative approaches 

Predictive distribution. The Bayesian framework allows us to compute so-called predictive dis- 
tributions, i.e. the distributions of future observations given past observations. Given a prior 
distribution n{6) and the corresponding posterior distribution n{6\i/) related to the past obser- 
vations y = (yi, . . .,yN)/ the predictive distribution for the future observation yw+jt is defined 
as 

g{yN+k\y) ■■= J HyN+k\o)^{s\y)de, 

and the optimal prediction for the risk is then: 

yw+fc := E"[yN+fc|y] (11) 

= J yN+kgiyN+k\y)dyN+k- (12) 

The comparison criterion. To assess the quality of the estimation of the model with our informative 
prior with regard to the estimation of the model with the non-informative prior, we compare 
both results based on the quality of the predictions. Let y^+i be the upcoming observation, the 
prediction error can be written as 

yw+i -yN+i = [yN+i -/n+i('7o)] + [/n+i('?o) -Vn+i], 

which expresses the prediction error as a sum of a noise yjv+i — fN+iiVo) (whose theoretical 
distribution is J\f{0, u^)) and a bias which can be seen as an estimation error over the prediction 
/n+1 ivo) ~ Vn+i ■ We focus solely on the second part, since the first part (the noise) is unavoidable 
in real situation. Given that we want to validate our model on simulated data, the quantity 
/n+i('/o) ~ Vn+i is indeed accessible here whereas it would not be in real situation. 

We thus choose to consider the quadratic distance between the real and the predicted model 
over a year as our quality criterion for a model, i.e.: 



2 365 

3^ J^lfN+iiVo) - VN+iV- (13) 
4.2 Construction of simulated datasets 

Both datasets A and B were simulated according to the model l|2} with d^ = 4 (4 frequencies 
used for the truncated Fourier series). The calendars and the partitions used for A and B were 
designed to include 7 daytypes (^2 = 7, one daytjrpe for each day of the week), but did not 
include any special days such as bank holidays. They also included 2 offsets (di2 = 2) to simulate 
the daylight saving time effect. In the end we thus had ^^=4x2-1-2 = 10 and dp = 6 i.e. d = 19 
using the expression of the model given in ||2|. 



\ 
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Dataset A. We simulated 4 years of daily data for A with parameters: 







= 2, 


seasonal: 




= (27, 7, -3, 1, 5, -1, 4, 0.5, 490, 495), 


shape: 




= (0.13,0.15,0.16,0.16,0.16,0.13), 


heating: 




= -3, 






= 14. 



These values were chosen to approximately mimic the typical electricity load of France up to 
a scaling factor. The temperatures we used for the estimation over A are those measured from 
September 1996 to August 2000 at 10:00AM. 

Dataset B. We simulated 1 year of daily data for B with parameters: 

Vf = 1, . . . , d« 
V/ = 2, . . . , 



where the coordinates of the true hyperparameters k were allowed to vary aroimd 1. The 
temperatures we used for the estimation over B are those measured from September 2000 to 
August 2001 at 10:00AM. 

We also simulated an extra year of daily data B for prediction, with the same parameters 
but using the so-called normal temperatures, meaning that for each day of this extra year the 
temperature is the mean of all the past temperatures at the same time of the year. We made 
such a choice to try and suppress any dependency between our simulated results and the chosen 
temperature for this fictive year of prediction, since we did not want to bias our results because 
of a rigorous winter or an excessively hot summer. 

4.3 Results 

We chose to use vague priors (i.e. proper distributions with large variances) for the uppermost 
layers of our informative hierarchical prior, and thus decided to use the values: 

(Tc, = 10^, ay = br = 10"^ a, = b, ^ 10^^. 

A study of the Bayesian hierarchical model's sensitivity to these values showed that changing 
these hyperparameters to achieve prior variances of greater magnitudes hardly influenced the 
posterior results (means and variances) at all. This is why we decided to stick to these values for 
the remainder of our experimentations. 

Estimation. We benchmarked the Bayesian model with its informative prior against its non- 
informative prior counterpart for different choices of true hyperparameters k over 300 replications 
(data being simulated anew for each replication), i.e. we simulated many different datasets B 
looking more or less similar to A and applied our method on them. Figure |2] shows the posterior 
error of t] (posterior mean minus the true value) of i], based on 300 replications that correspond 
to the case where k^ — k^ = k^y = k^ = 1 i.e f/^ = rj^ for both the informative (leftmost) and 
non-informative (rightmost) method. Marginal confidence interval for the posterior means are 
much smaller when using the informative prior (most of them hitting the true value). The 
marginal posterior standard deviations (not shown here) are also reduced when the informative 
prior is used instead of the non-informative prior. 



CT" =2, 
seasonal: ocf = k^af', 

shape: /5f = kpfi^, fif = /5f , 
heating: = kjj'^, 

U^=kuU^. 
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Figure 2: The posterior error (posterior mean minus true value) of a (seasonal parameters), 
fi (shape parameters), and 7 and u (heating parameters), based on 300 replications. Leftmost 
replications correspond to the informative method while the rightmost replications correspond 
to the non-informative method. Here = = = ku = 1. 
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Figure 4: The posterior error (posterior mean minus true value) of (seasonal parameters), 
based on 300 replications. Top row is for the case were = = kr^ = A:,, = 1 and bottom row 
is for the case where k^ = ku = 1 and ka = kj = 0.5. Leftmost replications correspond to the 
informative method while the rightmost replications correspond to the non-informative method. 
Posterior errors of k^ (shape parameters), and kj and k^ (heating parameters) are not shown 
here because no significant deviation from was foimd on either of these coordinates when the 
informative prior was used in either case (the empirical variances on these coordinates were 
bigger in the non-ideal case though, in a similar fashion to what we observe here for fca). 



When the situation is far from being as ideal as the one mentioned above, the informative 
approach still shows improvement over the non-informative approach but to a lesser extent. 
Figure |3] shows that the estimations of some of the parameters of the model are improved 
with the addition of the prior information (a and u) while some are not (/5 and 7) in the case 
where k^ = ku = 1 and ka — k^ = 0.5. Situations such as k^ = kj = ku = 1 and k^ = 0.5 or 
ka = kj = kjj = 1 and ku = 0.5 were studied too and yielded very similar results i.e. lesser 
improvements on the estimations of some parameters only. Note that when some coordinates of 
k are valued to 0.5 while some are valued to 1, the "similarity" between A and B is very weak. 
The strength or weakness of the similarity between A and B cannot be diagnosed directly from 
the posterior mean of k itself but we will see that the estimations of the hyperparameters q and r 
may provide a partial answer to this question. 

We also estimated the hyperparameters (see Section [3lT| for the specifications of k, I, r) when 
the informative prior was used. Let us first study the hyperparameter k. Its coordinates seem 
correctly estimated for the ideal situation where k^ = k^ = k^ — ku = 1 as illustrated in the top 
row of Figure |4] which shows the posterior error of k. When kj^ = ku = I and ka = kj = 0.5, the 
estimations obtained are of lesser quality as demonstrated in the bottom row of Figure |4] most of 
the seasonal similarity coefficients appear to be biased (while the posterior standard deviation on 
each coordinate, not shown here, are greater than in the ideal situation). These estimations may 
thus be used to quantify the closeness of the two datasets. 

The estimation of the hyperparameter / itself does not seem to provide a lot of information 
about the data: during our simulations, its mean value exhibited a lot of variability around the 
same value over the 300 replications for each of the five simulated scenarios and no reasonable 
conclusion could be drawn from it. 

On the other hand, the estimation of the hyperparameter q does reveal a bit of information 



13 



Figure 5: In grey: posterior mean of q (left) and r (right, on a log scale) for the informative prior 
(abscissas have been jittered a bit to prevent overlapping, and different shades of grey are used 
to indicate the level of the estimated density). 300 replications for each value of = kj tested. 
In black: the circles correspond to the averages, while the squares correspond to the 5% and 95% 
empirical quantiles. Here k^ = ku = 1- 

about the two datasets A and B. It is the mean of the coordinates of k on the real axis, as can 
be seen in the definition of the informative prior in ||5} on page |7] However its use remains 
somewhat limited in the sense that the parameters j6 of the two datasets are most often very close 
(meaning the coordinates of k that correspond to them is likely close to 1) while other parameters 
may vary greatly. Hence even though q provides information about the similarity between A 
and B, it carmot be interpreted alone and has to be considered jointly with r. The left part of 
Figure [5] shows the evolution of the posterior mean of q aska = ku ranges over [0.5, 1]. 

The estimation of the hyperparameter r (inverse-variance of the prior distribution on k, see 
(|5]l again) does in fact reveal some information about the two datasets too. It is a measure of 
dispersion of k around q, in the sense that the (higher it is, the closer to q the coordinates of k 
should be. Just like q is the mean of the coordinates of k, r is in fact their inverse-variance. The 
right part of Figure |5] shows a clear decline of r when fc^, = A:,, moves away from the ideal value 1 
i.e. when the similarity between the datasets A and B decrease from strong to weak. 

As we previously stated, the similarity between the two datasets has to be assessed simulta- 
neously with q and r and not q only: the mean q could be close to 1, possibly hinting at a perfect 
similarity between the two datasets, while the variance 1 / r could be great which would then 
indicate huge differences between the two estimated sets of parameters for the two datasets. 
Prediction. We compared the informative and the non-informative models using our comparison 
criterion defined in (13) and computing the ratio between the two models for different values of 
ka and fc-y, k^ and ku being both set to 1. The left part of Figure [6| shows the results we obtained 
for fcfl. and k^y simultaneously set to the values 1,0.95, 0.90, 0.80 and 0.50. Note that since the 
results appeared to be approximately symmetric with regard to 1 (i.e. for values 1,1.05,1.10,1.20 
and 1.50), we only included one side of the graph in the present article. 

On average, the Bayesian informative model is a clear improvement over the Bayesian 
non-informative one, its performances being maximised when the parameters r]'^ and rj^ are 
identical (which is the ideal situation). The performances in prediction are obviously somewhat 
weakened when the difference between the parameters r]-^ and grows greater, but the use 
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Figure 6: In grey: ratio between error predictions for the informative and the non-informative 
approach (abscissas have been jittered a bit to prevent overlapping, and different shades of 
grey are used to indicate the level of the estimated density). 300 replications for each value 
of fcfl = kj (left, where kj^ = ki, = 1) and ku (right, where ka, = k^ = kj = 1) tested. In black: 
circles correspond to the averages, while squares and diamonds correspond to the 80% and 90% 
empirical quantiles of these ratios. 

of the informative hierarchical prior still leads to an average improvement of 15% over the 
non-informative model, as can be seen on Figure |6] The results obtained when A:^ or A:„ are 
varying while the other coordinates of k are fixed to 1 were very similar (see for example the 
right part of Figure |6}. 

5 Applications 

The long dataset A needed for the construction of the informative prior corresponds to a 
specific population in France frequently referred to as "non-metered" because their electricity 
consumption is not directly observed by EDF but instead derived as the difference between 
the overall electricity consumption and the consumption of the "metered" population. For this 
population the data ranged from 07/01/2004 to 07/31/2010. 

We illustrate the benefit of choosing our informative prior to predict electricity load on 
short datasets. We consider two short datasets : the first B corresponds to the "non-metered" 
population for ERDF, a wholly owned subsidiary of EDF that manages the public electricity 
network for 95% of continental France. This population roughly covers the same people that A 
does, but not exactly. The second dataset B' corresponds to a subpopulation of A and represents 
around 50% of the total load of A. 

5.1 Benchmark against standard methods 

For this application, only the days for which no special tariffs are enforced were considered: the 
so-called EJP ("Effacement jour de pointe" = peak tariff days) were removed from the dataset 
beforehand to ensure the signal studied was consistent throughout time. Bank holidays (including 
the day before and the day after to avoid any neighbourhood contamination effects), the summer 
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Estimation A 


Estimation B 


Prediction B 


Case 1 


1099 


125 


28 


Case 2 


1099 


144 


38 



Table 1: Sample size (in days) of the datasets for both experiments. 







Case 1 




Case 2 




RMSE 


MAPE 


RMSE 


MAPE 


informative prior (I) 


770.71 


3.08 


2041.70 


5.71 


non-informative prior (NI) 


24440.52 


100.26 


7317.56 


22.03 


non-parametric (NP) 


1461.52 


5.82 


25091.68 


77.41 


double exp. smoothing (DES) 


1800.12 


7.41 


22572.58 


71.36 


linear time series (ARIMA) 


1702.39 


6.89 


13839.85 


43.73 



Table 2: Overall quality (RMSE in MW, MAPE in %) of the predictions for both experiments. 



holiday break (August) and the winter holiday break (late December) were also removed from 
the dataset for this first application, as we wanted to benchmark our method against others on a 
smooth and rather easy-going signal, so as not to put any one method at a disadvantage due to 
the signal's specificity. The temperature considered in the model is the average temperature over 
France for the period of study, and the cooling threshold was chosen to be 16A°C throughout the 
48 instants of the day. 

We benchmarked our Bayesian method with informative prior against four alternative meth- 
ods, comparing their predictions on dataset B in two configurations. Roughly speaking, for 
our first experiment we estimated the model for B over the period ranging from 12/01/2009 
to 06/30/2010 and predicted the next 30 days (same as the application presented Section 5.2 1, 
while for our second experiment, we estimated the model for B over the period ranging from 
01/01/10 to 07/31/10 and predicted the previous 30 days. We expect the first configuration to 
be the easy case and the second configuration to be the tough case, the signal being very smooth 
during summer and not so much during winter. The figures shown in Table [l] summarise the 
exact lengths of the various datasets for both experiments. 

The four alternative methods we benchmarked against our own Bayesian informative method, 
relied on four different techniques: the first was the Bayesian non-informative method that we 
exposed earlier in Section 3.2 (recall that it was meant to be an equivalent to the maximum 



likelihood approach), the second involved non-parametric estimation with kernels (see Fan and 



Viol [2005) , the third was a double exponential smoothing (see ITaylorl 12003) and the fourth and 
last was an ARIMA model. Note that for the second experiment, the data available obviously 
had to be time-reversed in order to apply some the last three alternatives methods since time- 
dependence plays an important role for them. The ARIMA model was automatically selected 
(see H3mdman and Khandakar |2008) and was the best model with regard to the AIC criterion. 

It is clear from the results exposed in Table |2] that the informative prior outperforms all the 
alternative methods by a large margin in each case. Figure [7] shows that the Bayesian informative 
method is superior to the Bayesian non-informative method, the non-parametric approach, the 
double exponential smoothing method as well as the ARIMA Model throughout the 48 instants 
of the day in both cases, with the exception of night-time for Case 1 where the non-parametric 
and ARIMA model remain competitive. This can be attributed to the nocturnal signal being very 
smooth in July, compared to the signal in winter. The overall bad performance of the Bayesian 
non-informative method is not surprising because at least 3 to 4 years of data are usually required 
to avoid overfitting, for such a parametric model. 
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case 1 



case 2 




Figure 7: Quality of the predictions (MAPE in %) averaged for each instant, for both experiments: 
case 1 (left) and case 2 (right). The ordinate axis is in log-scale. Each shade of gray corresponds 
to one of the 5 methods tested. 



5.2 Role of the hyperparameters 



For this application, the setup was the same as the one described at the start of Section 5.1 Our 
aim was to point out the role of the hyperparameters introduced within the informative prior 
and show that besides providing better results than alternative methods as was demonstrated in 



Section 5.1 they also provided a measure of similarity between the short datasets of interest and 
the dataset used to build the prior. We estimated the model for B and B' over the period ranging 
from 12/01/2009 to 06/30/2010 and predicted the next 30 days. 

Figure [s] shows the predictive quality of the model for both populations B and B' using the 
informative prior as well as the posterior densities of the similarity coefficients kj at midday (we 
also looked at the other 47 instants but the look of them was nearly identical to the one we chose 
to present). These densities are much more peaked and also centred closer to 1 for population B 
than they are for population B' . It thus seems to indicate that dataset B is more similar to A than 
B' is, confirming our prior knowledge that A and B covered approximately the same population 
whereas B' represented around 50% of A: this value of 50% is also visible on Figure [s] where we 
observe two densities centred around 0.5, which correspond to the similarity coefficients between 
the offsets coj of A and these of B'. 

Figure [9] displays the boxplots for the posterior densities of q and 1 / ^ and seem to corrobo- 
rate the fact that B is more similar to A than B' is. Recall that q and 1 / ^/r respectively act as the 
mean and standard deviation of the similarity coefficients kj within our informative hierarchical 
prior. Indeed the estimated mean of q appears to be closer to 1 while its estimated variance is 
smaller on B than B'. The estimated mean and variance of \l \pr are also smaller on B than B' . 

As we observed in Section |4] when we dealt with simulated datasets, the estimated values of 
q and r provide some information about the similarity between the datasets considered. Notice 
also that, here again, the best predictive performance is obtained when the similarity between 
the two datasets is strongest. 
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Figure 9: Boxplots of the posterior densities of q (left) and 1 / \/r (right) (mean and standard 
deviation of the similarity coefficients kj) for populations B and B', throughout the 48 instants of 
the day. 
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5.3 Role of the sample size 

For this application, the setup is almost identical to the one described at the start of Section 



5.1 but the temperature considered in the model is not the average temperature over France 
anymore but a transformation of it: it was smoothed using exponential smoothing, which is 
known to improve the link existing between the two variables temperature and electricity load 



(see [Bruhns et al. 2005 for more information about this). The cooling threshold was fixed at 
18A°C throughout the 48 instants of the day, and this time, the summer holiday break was not 
removed from the dataset (but the winter holiday break and the bank holidays still were), so that 
the model could benefit from (and be tested on) the August months in general. Note that for A 
we used the same dataset that we used for our two first applications. 

We put the focus on the length of the estimation period on B while keeping the same 
prediction window. We successively chose the periods ranging from 01/01/2010, 03/01/2010, 
05/01/2010, 07/01/2010 to 12/31/2010, reducing the estimation period on B from 12 months to 
only 6 months, removing 2 months at a time. The next 6 upcoming months were then predicted 



i.e. the prediction window ranged from 01/01/2011 to 06/30/2011. The diagram in Figure 10 
describes the 4 scenarios considered. 

The non-informative prior leads to a better fit than the informative prior as can be seen in 
Table |3] It should not come as a surprise because the non-informative prior was indeed meant 
to be equivalent to a maximum likelihood approach whose criterion is precisely to minimise 
the RMSE. As for the quality of the predictions associated with the model for both priors. 
Table |3] demonstrates that the informative prior beats the non- informative prior in each of the 
four proposed configurations. The improvement appears to be minimal when 12 months are 
used but as months are removed from the estimation window, the predictive quality for the 
non-informative prior drops very quickly, while the predictive quality for the informative prior 
remains moderate and stable. 



Figure 11 shows the average error in prediction for each month while Figures 12 and 13 
display the average error in prediction for each half hour (from 00:00 to 23:30). It is important to 
note that the use of the non-informative prior leads to overfitting the model: the results presented 
in Table |3] show that as the estimation window goes smaller, the estimation error decreases while 
the prediction error grows very quickly. A close inspection of the posterior densities of the 
different parameters of the model revealed that the bias induced by the increasing lack of data is 
mainly seasonal: this is due to the seasonality coefficients of the model being overfit. Choosing 
the informative prior over the non-informative prior makes the estimation and prediction of the 
model more robust with regards to the lack of data. 

The informative prior especially improves the quality of the predictions when the lack of data 
is severe: it provides reasonable forecasts even in the worst scenario considered here, where only 
6 months of estimation were used for 6 months of prediction. In this situation, estimation (from 
07/01/2010 to 12/31/2010) and prediction (from 01/01/2011 to 06/30/2011) are performed 
on non-overlapping areas of the calendar: the informative prior makes up for the unavailable 
data and prevents the model from overfitting on the second half of the calendar, while the 
non-informative prior does not and consequently leads to heavUy biased predictions over the 
first half of the calendar. 









estimation 


prediction 












10 months 












8 months 












6 months | 







6 months 
6 months 
6 months 
6 months 



jan. 2010 mar. 2010 may. 2010 jul. 2010 jan. 2011 jul. 2011 

Figure 10: Ranges of the estimation (from 12 to 6 months) and prediction (6 months) time- 
windows for the 4 scenarios considered. 
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Estimation Prediction 







RMSE 




MAPE 




RMSE 




MAPE 




non-info. 


info. 


non-info. 


info. 


non-info. 


info. 


non-info. 


info. 


12m. 


663.02 


671.95 


1.86 


1.87 


763.23 


737.83 


2.01 


1.94 


10m. 


606.04 


623.23 


1.78 


1.82 


1509.09 


883.07 


3.18 


2.21 


8m. 


473.29 


493.68 


1.49 


1.52 


8891.81 


1318.28 


16.72 


3.26 


6m. 


460.60 


499.13 


1.34 


1.44 


90356.82 


1305.27 


224.40 


3.62 



Table 3: Overall quality (RMSE in MW, and MAPE in %) of the estimation (left) and prediction 
(right) for the non-informative (non-info.) and informative (info.) priors, depending on the 
niimber of months used for the estimation (from 12 months to 6 months). 



Non -informative prior (Nl) 



Informative prior (1) 



■ 12m. □ 8m. 
□ 10m. □ 6m. 



m 



m 12m. □ 8m. 
□ 10m. □ 6m. 



■ill 



mm 



Jan. Feb. Mar. Apr May Jun. 



Jan. Feb. Mar. Apr. May Jun. 



Figure 11: Quality of the predictions (MAPE in %) averaged for each month (all the instants of the 
30 or so days within each month are used to compute these averages), with an estimation period 
ranging from 12 to 6 months using the non-informative prior (left) and using the informative 
prior (right). The ordinate axis is in log-scale. Each shade of gray corresponds to a different 
scenario. 
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Non-informative prior (Nl) 



12in. □ 8fn. 
10m. a 6m. 



1 2 3 4 5 






21 23 25 27 29 31 33 35 37 39 41 



43 45 47 



Figiire 12: Using the non-informative prior: quality of the predictions (MAPE in %) averaged 
for each instant (all the 180 or so days within the prediction time-window are used for those 
averages), with an estimation period ranging from 12 to 6 months. The ordinate axis is in 
log-scale. Each shade of gray corresponds to a different scenario. 



Informative prior (I) 



12m. 
10m. 



8m. 
6m. 



i 




1 2 3 4 5 



21 23 25 27 29 31 33 35 37 39 41 



Figure 13: Using the informative prior: quality of the predictions (MAPE in %) averaged for each 
instant (all the 180 or so days within the prediction time-window are used for those averages), 
with an estimation period ranging from 12 to 6 months. The ordinate axis is in log-scale. Each 
shade of gray corresponds to a different scenario. 
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6 Appendix 



The two MCMC algorithms presented below were developed because direct simulations from 
the posterior distribution were not possible. The justifications are given after the algorithms 
themselves. Notice that the full conditional distributions of all the parameters but the threshold u 
appear to be common distributions in both cases, due to the presence of multiple semi-conjugacy 



situations. We used a Metropolis-within-Gibbs algorithm (see Marin and Robert} 2007 page 96, 
for a quick description) based on Gibbs sampling steps for every parameter but u for which 
we used a Metropolis-Hasting step based on a Gaussian random walk proposal. The algorithm 
corresponding to the non-informative prior is detailed first since it is the simplest of the two. 



6.1 Technical Lemmas 

Definition 6 (Gaussian conjugacy operator). We define the (commutative and associative) operator * 
as 



^1 



¥2 
^2 



-i + 2:2-^]-i(2:rVi + S2"V2) 
p-i + s:,-!]-! 



for any vectors }i\ and }i2 in M.'^,for any symmetric positive definite matrices Si and S2 of size d x d. 
Lemma 7 (Conjugacy). Let Xi and X2 be two random truncated Gaussian vectors in 

Xi ~ A^(?ii,2:i,Si) 

and denote fi and f2 their respective densities, then /1/2 is integrable. Let furthermore Y be a random 
variable with density g{y) « f\{y)f2{y), ihen Y has truncated Gaussian distribution 



Y- A/'(f/,S:,SinS2) 



where 



^1 



F2 
S2 



and this result easily extends to any finite number of random truncated (or not) Gaussian vectors. 
Lemma 8 (Conditional distribution). Let Xbe a random Gaussian vector in R'' 



X 



X2 



Fi 

F2 



R S 
S' T 



and Xj and X2 the projections ofX over its di first and d2 last coordinates (d = di+ ^2). The conditional 
distribution of Xi with regard to X2 is then Gaussian 

Xi\X2^Afim-R'~^S{X2-}i2),R'^) 

Lemma 9. Let X and Y be two random vectors respectively in M.'^ and R" such as the conditional 
distribution ofY with regard to X is Gaussian 

Y\X M (z + MX,cr^ln 



with M matrix of size n x d that has full rank d < n, and let Z be a fixed vector in R". The conditional 
distribution of X with regard to Y is then Gaussian too 



X\Y ~ M (^[M'M]-'^M'{Y - Z),a^M' 



M 
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Proof. Denoting W = Y — Z, straightforward algebra leads immediately to 

{W - MXycr^I„{W - MX) = [{M'My'^M'W -X]'cr^M'M[{M'My'^M'W -X] 

- [iM'M)-'^M'W]'a^M'M[{M'M)-'^M'W] 

where the two last terms on the right hand side of the equation do not depend on X. □ 

6.2 MCMC algorithm for the estimation of the posterior distribution, using 
the non-informative prior 

In the lines below, we give the different steps of the MCMC algorithm we used to (approximately) 
simulate . . . , 9m) according to the posterior distribution Tz{6\y) corresponding to the non- 
informative prior we presented earlier. The algorithm goes as follows: 

Step 1. Initialise 6-[ such that Tz{6i\y) 7^0 

Step 2. For t = \, . . . ,M -1, repeat 

(i) . Simulate cr^_^_i cond. to {(X-t, ^t,Jt,Ut,y) i.e. 

7 /N 1 , 

o-lr-^Q[j,^\\y-fml 

(ii) . Simulate 7t+i cond. to {oct, f>t,Ut,a'^^yy) i.e. 
(iii) . Simulate fef+i cond. to {oLt,^t+liUt,crfj^yy) i.e. 

(iv) . Simulate cit+l cond. to {l}t+i,'yt+l/Ut,0'f^i,y) i.e. 

(v) . Simulate (^t --^ A/'(0, Smh) > simulate Vt'~^L{[0,l] and define Ut = Ut + St 
• define Mt_|_i = Ut if 

n{ut I oit+i, Pt+i, It+i, c^t+i/ y) 



Vt < 



n(ut I cit+i, ^t+i, 7t+i' c^t+i' y) 



or Mf+i = Mf otherwise 



where the covariance matrix Smh used in this last Metropolis-Hastings step is first 
estimated over a burn-in phase (the iterations coming from this phase are discarded), 
and then fixed to its estimated value "asymptotically optimally rescaled" for the final 
run by a factor (^^)^ (as recommended for Gaussian proposals in section 2 of 



and RosenthaIl|20nT) . 



Roberts 
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The justifications for each full conditional distribution used in the Gibbs sampling steps, 
including the explicit expressions of n^^^,!!^^^ ^f+i' now given. Lemma 

|9]is a key element to these justifications. 

Full conditional distribution of a. Denote n the size of the vector a, and 9\a the vector 9 from which 
the coordinates corresponding to a have been removed. The full conditional distribution of a can 
directly be deduced from both the prior and the likelihood contributions to it. 

Let us first observe that, since the prior distribution we are using on oc is flat, the full 
conditional distribution of a is in fact proportional to the likelihood function (seen as a function 
of a). Now considering the likelihood contribution, we write 

n{a\t]y,y) <x exp ( -^cr'^Wy - f{r])\ 



2 

Let now be the diagonal matrix whose diagonal coefficients are given by 

{U)tt = Bt,^ + Ct,t = l,...,N, 
let Zffi be the vector whose coordinates are given by 

(Z„)f = 7(T, - m)1[7-^^+^[(m), f = 1, . . . ,N, 
and denote M« the matrix Ma = L^A. We can now rewrite p and get 



n{(x.\0\oi,y) (X exp (^-^'^ ^Wv ~ (2^* + MaOc)\\l^ 



Using Lemma |9| it is then straightforward to see that the full conditional distribution of a is 
Gaussian 

a|0\a,y ~A^(f/M:«) (14) 

where 

^ / [M'M-^M',{y-Z,) 

Full conditional distribution of /5. Using similar arguments, we obtain the full conditional distribu- 
tion of j6. Namely, denoting Z^ the vector whose coordinates are given by 

(Z/3), = (Aa)tCf + 7(T, - m)1[7-,,+^[(m), 

and calling = L^B where is the diagonal matrix whose diagonal is Act, we obtain the 
truncated Gaussian distribution 

/3|0\/5,y~A^(/,E^Bj(O,l)) (15) 

where 

\ ^ / [M'^Mf,]-^M'i,{y-Zf,) \ 



Full conditional distribution ofy. Using again similar arguments, we obtain the full conditional 
distribution of 7. Namely, denoting Zy the vector whose coordinates are given by 

{Zy)t = {Aa)t{{Bp)t + Ct), 
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and calling Mj the vector whose coordinates are (Tf — u)l^j^^^^^{u) we obtain the Gaussian 
distribution 

j\e\j,y^M{^'r,i:'r) (16) 

where 

^7 \ ^ / [M'^Mj]-^M'^{y-Zj) 

Full conditional distribution of cP-. No calculations are required, as we unmediately identify an 
inverse-gamma distribution from |[7|. 

6.3 MCMC algorithm for the estimation of the posterior distribution, using 
the informative prior 

In the lines below we give the different steps of the MCMC algorithm we used to (approxi- 
mately) simulate (flj, . . . , 0^4 ) according to the posterior distribution T[{6\y) corresponding to the 
informative prior we presented earlier The algorithm goes as follow: 

Step 1. Initialise 6^ such that n{6i\y) 7^0 
Step 2. For t = \, . . . ,M - I, repeat 

(i) . Simulate u^^j cond. to {(X-t, ^tiJtiUt,kt,li,qi,rt,y) 

^li-Tg{^,\\\y-f{r^)\\l 

(ii) . Simulate rf+i cond. to {oct, f>t,lt,Ut,crf^^,kt,lt,cit,y) 

( d 1 
n+l ^Q\ar + -, hr + - Y,(ki - qY 

(iii) . Simulate qt+i cond. to [oct, ^t,yt,Ut,crf^i,kt,ltJt+l,y) 

d 

E 

i=l 

(iv) . Simulate /f+j cond. to {dt, fit,l't,Ut,of^^,kt,qt-\-'i,Tt+l/y) 



qt+i ~ + rd] -1 ((7-2 + r ^ k,), + rd\ 



d , 1 

(v) . Simulate fcf+i cond. to {oct, fit,7t,Ut,crf_^_i,h+l,qt+l,rt+i,y) 



(vi) . Simulate ^t+l cond. to {a^t, f!'t,Ut,crf^i,kt+i,lf+i,qt+i,rt+i,y) 



25 



(vii) . Simulate cond. to ioit,Jt+l,Ut,crf_^i,kt+i,lf+i,qt+i,rt+i,y) 

(viii) . Simulate at+i cond. to (i6t+i,7f+l/"t/C^f^+i/^f+l/'t+l/^f+l/''f+l/3/) 

af+l~A/-(//?+i,2:?+i) 

(ix) . Simulate (5t --^ A/'(0, Zmh)/ ft ~ W[0, 1] and define Ut = Ut + 3t 

• define Hf+i = Uf if 

n{ut\at+i, /3t+i, 7f+i, cr2_^, fcf+i, U+i, qt+i, rt+i, y) 

• or Mf+i = Mf otherwise 

where the covariance matrix Smh used in the Metropolis-Hastings step is first estimated over 
a burn-in phase, and then fixed to its rescaled estimated value for the real run as in the non- 
informative approach. 

The justifications for each full conditional distribution used in the Gibbs sampling steps, 

including the explicit expressions of and I^f^^, are now 

given. To derive these full conditional distributions, we will make use of the technical Lemmas [7j 
|8] and |9] presented earlier. 

Full conditional distribution of a. Denote n the size of the vector a, and 6\a the vector 6 from 
which the coordinates corresponding to a have been removed. The full conditional distribution 
of a can directly be deduced from both the prior and the likelihood contributions to it. Denote 
9^ = {6,k, l,q, r), and write the full conditional distribution of a. as 

Ti{oL\e^\oi,y) oi gi{a)gp{a.) 

where giioi-) is the contribution of the likelihood (seen as a function of a to the full conditional 
distribution) and gp{c() is the contribution of the prior (seen as a function of a). We prove that gi 
and gp both correspond to Gaussian distributions before using Lemma [7| to combine them into 
yet another Gaussian distribution. 

1. Let us first consider the prior contribution gp. Recall first that a only appears in the 
following component of the prior 

n{e\k,l) a /z exp (^-^(0 - Kfi^yiiH-^y^e - K}i^)^ , 

which directly implies that 

gpia) a exp (^-1(0 - Kfi^yiiZ^rHe - Kfi^)^ . 

Denote }i — Kji-^, Z = Z^^Z-^ and denote jia. and /i^y the vectors resulting from the 
extractions of the coordinates corresponding to a and t]\ix. from }i. Finally denote R^ua) 
matrix resulting from the extraction of the rows and colunms both corresponding to a of 
and denote S(„ the one resulting from the extraction of the rows corresponding 
to a and columns corresponding to f]\a oi T,^^ . Using Lemma [t] (and reordering indexes 
if necessary) it is straightforward that gp{c() is proportional to the density of a Gaussian 
distribution 
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2. Let us now consider the likelihood contribution. Using exactly the same notations that 
we used for the full conditional distribution of a for the algorithm associated with the 
non-informative approach we immediately find that giioi-) is proportional to the density of 
a Gaussian distribution 

just as in |[T4|. 



3. With the help of Lemma[7|and using the two results above, we can now deduce the posterior 
conditional distribution of a. and obtain the Gaussian distribution 



where 



Full conditional distribution of /5. Using similar arguments, we obtain the full conditional distri- 
bution of j6. Namely, keeping the notation introduced to derive | (T5) , and combining the prior 
and the likelihood contributions together with Lemma [7] we obtain the truncated Gaussian 
distribution 



/3|0A/5,y-A^f/,5:^Bt''(O,l) 



where 



Full conditional distribution of 7. Using again similar arguments, we obtain the full conditional 
distribution of 7. Namely, keeping the notation introduced to derive | (T6) , and combining the 
prior and the likelihood contributions together with Lemma [7]we obtain the Gaussian distribution 

where 

^7 \ _ / ^^-R-4)S(^,^\^)(;?\7-F,,\-,) \ / [M\M^]-^M'^{y-Z^) 

Full conditional distribution ofk. We denote M-^ = diag(f/'^) and first notice that M-^k = Kji-^. 
Using the definition of the informative prior and Lemma we then immediately derive 

k\e,\k,y ^ ^ 

where 

F'\_f ^(1 \j (M^r'v 

1:'' )~[ r-^k ) [ l-^{{M^)-^i:-^{M^)-^} 

Full conditional distribution ofl,q,rand (p-. No calculations are required, as we respectively identify 
a gamma distribution, a Gaussian distribution, a gamma distribution, and an inverse-gamma 
distribution from ||6|. 



27 



Acknowledgments 



The authors would like to thank Adelaide Priou for collecting a part of the data as well as the 
corresponding results, and Virginie Dordonnat for the insightful discussions. 

References 

Al-Zayer, J. and Al-Ibrahim, A. (1996). Modelling the impact of temperature on electricity 
consiraiption in the eastern province of saudi arabia. Journal of Forecasting, 15:97-106. 

Bruhns, A., Deurveilher, G., and Roy, J. (2005). A non-linear regression model for mid-term load 
forecasting and improvements in seasormality. Proceedings of the 15th Power Systems Computation 

Conference 2005, Liege Belgium. 

Bunn, D. and Farmer, E. (1985). Comparative models for electrical load forecasting. 

Cottet, R. and Smith, M. (2003). Bayesian modeling and forecasting of intraday electricity load. 

Journal of the American Statistical Association, 98(464):839-849. 

Cugliari, J. (2011). Prevision non parametrique de processus a valeurs fonctionnelles. Application a la 
consommation d'electricite. PhD thesis, Universite Paris Sud XI. 

Dordonnat, V. (2009). State-space modelling for high frequency data. PhD thesis, Vrije Universiteit 
Amsterdam. 

Dordonnat, V., Koopman, S., Ooms, M., Dessertaine, A., and Collet, J. (2008). An hourly 
periodic state space model for modelling french national electricity load. International Journal of 
Forecasting, 24(4):566-587. 

Engle, R., Granger, C, Rice, J., and Weiss, A. (1986). Semiparametric estimates of the relation 
between weather and electricity. Journal of the American Statistical Association, 81:310-320. 

Fan, J. and Yao, Q. (2005). Non linear Time Series: Nonparametric and Parametric Methods. Springer. 

Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. 
Cambridge University Press. 

Goude, Y. (2008). Melange de predicteurs, application a la prevision de consommation d'electricite. PhD 
thesis, Universite Paris Sud XI. 

Harvey, A. C. and Koopman, S. J. (1993). Forecasting hourly electricity demand using time- 
varying splines. Journal of the American Statistical Association, 88:1228-1237. 

Hippert, H., Bunn, D., and Souza, R. (2005). Large neural networks for electricity load forecasting: 
Are they overfitted? International Journal of Forecasting, 21:425-434. 

Hyndman, R. J. and Khandakar, Y. (2008). Automatic time series forecasting: the forecast package 
for R. Journal of Statistical Software, 88(3). 

Launay, T., Philippe, A., and Lamarche, S. (2012). Consistency of the posterior distribution and 

MLE for piecewise linear regression. Preprint. arXiv:1203.4753. 

Marin, J.-M. and Robert, C. (2007). Bayesian Core : A Practical Approach to Computational Bayesian 
Statistics. Springer. 

Menage, J. P., Panciatici, P., and Boury, F. (1988). Nouvelle modelisation de I'influence des 
conditions climatiques sur la consommation d'energie electrique. Technical report, EDF R&D. 



28 



Ramanathan, R., Engle, R., Granger, C, Vahid-Araghi, F., and Brace, C. (1997). Short-rtm forecasts 
of electricity loads and peaks. International Journal of Forecasting, 13:161-174. 

Roberts, G. and Rosenthal, J. (2001). Optimal scaling for various metropolis-hastings algorithms. 

Statistical Science, 16(4):351-367. 

Seber, G. and Wild, C. (2003). Nonlinear Regression. Wiley. 

Smith, M. (2000). Modeling and short-term forecasting of new south wales electricity system 
load. Journal of Business & Economic Statistics, 18:465-478. 

Smith, M. and Kohn, R. (2002). Parsimonious covariance matrix estimation for longitudinal data. 
Journal of the American Statistical Association, 97:1141-1153. 

Soares, L. and Medeiros, M. (2008). Modeling and forecasting short-term electricity load: a 
comparison of methods with an application to brazilian data. International Journal of Forecasting, 
24:630-644. 

Taylor, J. (2003). Short-term electricity demand forecasting using double seasonal exponential 
smoothing. Journal of the Operational Research Society, 54:799-805. 

Taylor, J. and Buizza, R. (2003). Using weather ensemble predictions in electricity demand 
forecasting. International Journal of Forecasting, 19:57-70. 

Taylor, J., De Menezes, L., and McSharry, R (2006). A comparison of imivariate methods for 
forecasting electricity demand up to a day ahead. International Journal of Forecasting, 22:1-16. 

Taylor, J. and McSharry, P. (2007). Short-term load forecasting methods: An evaluation based on 
euxopean data. IEEE Transactions on Power Systems, 22:2213-2219. 



29 



